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for computing the above events from data and a utility 
function only. The main contributions of this article are : 

• The way information is encoded in the utility func- 
tion, which represents a new and clear way to repres- 
ent the user knowledge independently of the system's 
intrinsic dynamics. 

• The decisional states concept, allowing a modeller to 
represent system states with equivalent decisions for 
the user based on the preceeding utility function. 

• The practical algorithm for computing the system 
states from data, which is also applicable for the re- 
construction of e-machines |S]. 

Section [2] relates the context in which we'll pose the above 
problem informally. Section [3] introduces formally how 
the knowledge brought by a utility function can be used 
in order to compute the internal states of the process 
leading to the same decisions. Section [4] gives mathem- 
atical examples of the theory introduced in Section [3] 
Section [5] details how to effectively compute decisional 
and other utility-related process states from data. Sec- 
tion [6] gives application examples, including infering hid- 
den states from symbolic time series, detecting patterns 
in cellular automata and edges in images. A general con- 
clusion is then given, followed by an appendix providing 
more details about predictions in a physical context and 
a second appendix highlighting the differences and com- 
monalities with previous work approaches. 

Free /fibre source code is available and a link to the refer- 
ence implementation is given at the end of the document. 

2 Background information 

The idea of using the expected utility in order to determ- 
ine which decision to take is not new Q and the first con- 
tribution of this paper listed in introduction lies in the 
way to define the utility function. Usually the utility is 
defined as a real-valued function associated to each out- 
come and quantifies the user's interest in that outcome. 
Probability theory is then used to estimate the expected 
utility one may get when taking different actions. The 



Abstract 

This article introduces the decisional states of system, and 
provides a practical algorithm for computing them. The 
decisional states are defined as the internal states of a 
system that lead to the same decision, based on a user- 
provided utility or pay-off function. The utility function 
encodes some a priori knowledge external to the system, 
it quantifies how bad it is to make mistakes. The intrinsic 
underlying structure of the system is modeled by an e- 
machine and its causal states. The decisional states are 
the emerging patterns corresponding to the utility func- 
tion. In a complex systems perspective, these patterns 
thus form a partition of the lower-level system states that 
is defined according to the higher-level user's knowledge. 
The transitions between these decisional states correspond 
to events that lead to a change of decision. An algorithm 
is provided so as to estimate the states and their trans- 
itions from data. Application examples are given for hid- 
den model reconstruction, cellular automata filtering, and 
edge detection in images. 

Keywords: e-machines; decisional states; utility. 

1 Motivation 

We are monitoring a system, and we are given a util- 
ity/cost function for comparing predictions made about 
this system to what happens really. For example, we 
are monitoring the weather. We have a pay-off func- 
tion U(y, z) related to setting an equipment outdoor, 
with y the weather we predict to take our decision and 
z what really happens. We benefit from the equipment 
in the case it is outside when the weather is good, so 
U(y — sunny, z — sunny) = 1, while we gain nothing when 
it is inside and it is raining : U(y — rain, z — rain) = 0. 
We miss an opportunity when we keep the equipment in- 
door when it could have been useful, so U(y — rain, z = 
sunny) = — 1 . The equipment gets damaged under the 
rain, so U(y = sunny, z — rain) = —2. We would like 
events telling us when to set up the equipment or not 
based solely on the current system state x. These events 
arc determined by maximising the expected utility of our 
predictions y based on x. 

This simple scenario is easy to transpose to more elab- 
orated contexts. This article presents the theoretical back- 
ground for this problem, as well as a concrete algorithm 



lr The Wikipedia entry |http: / /en . wikipedia.org/ wiki/Expe cted_utility| 
traces the history of the concept back to Bernoulli's work in the 
18th Century. 
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action with maximal expected utility is then usually re- 
tained, although risk-aversion effects are sometimes taken 
into account. There is an abundant litterature on prob- 
ability and decision theory \5\ [23] which detail these 
ideas. Markov Decision Processes [TO] [7] are a successful 
framework for modelling a system using this kind of utility 
functions. 

The present article introduces utility functions not on 
the outcomes themselves, but rather on the effect of acting 
according to predictions. This is a more natural way of 
thinking in many contexts, as illustrated by the weather 
forecasting scenario in introduction: the user tries to pre- 
dict the future and acts accordingly. The utility is determ- 
ined by the consequences of what really happens compared 
to the predictions, hence it is a function of two variables. 
The utility quantifies the consequences of making mistakes 
or being right about what happens next. 

The basis and theoretical fundation for the present con- 
text is thus rooted in predictive models. The e-machine 
is precisely such a model, where the system states are 
clustered in equivalence classes for making predictions 
[T"7] [8]. The e-machine is also a Markov model, in the 
sense that transitions between states do not depend on 
which other state was previously visited, e-machincs arc 
in addition the minimal Markov automaton leading to op- 
timal predictions while keeping deterministic transitions. 
See also [16] and [13] for other classes of Markov models 
with different properties. For readers familiar to Hidden 
Markov Models (HMM) and Markov Decision Processes 
Appendix B highlights the differences between a classical 
HMM model and the e-machine. 

When dealing with utility functions based on the effect 
of predictions the e-machine naturally becomes the under- 
lying model that is inferred from data. In other words 
the utility function determines a structure corresponding 
to the user knowledge on top of the e-machine, while the 
e-machine itself represents the system's internal relations 
independently of the user. This clear separation of internal 
structure vs. external knowledge is a neat secondary ef- 
fect of defining the utility in terms of the effect of user 
predictions instead of attributing a utility directly to each 
outcome. 

The third contribution of this paper listed in introduc- 
tion is to present a family of algorithms (with a practical 
implementation) for reconstructing e-machines and their 
extension to the new framework introduced in this docu- 
ment (See Section |5| . This new family of algorithm offers 
more flexibility in its data representation and choice of 
parameters than the previous one [19] . while providing a 
computational performance that makes it suitable for a 
large class of practical applications (See the examples in 
Section [6]) . It is possible to call only the e-machine re- 
construction part (See Section 6.1l of the algorithm, and 



of what the framework proposed in this document is about. 
The next section describes the framework formally. 

3 Decisional states framework 

3.1 General problem targeted by the pro- 
posed framework 

Let X be a measurable space comprising configurations 
x of the system under investigation. Let Z be a measur- 
able space of all entities that we wish to predict from the 
current system state. For example: 

• In a symbolic series context, X is the set of all past 
strings up to the current point (x~ € A* in Appendix 
B), and Z is the set of all future strings after the 
current point (x + G A* in Appendix B). The concrete 
example in Section [6~T] highlights this case. 



• More generally for temporal systems with state space, 
an x £ X should include all causal influence from the 
past that might possibly affect the present (i.e, a past 
light cone). Similarly, z £ Z is the set of future light 
cones. Appendix A details this approach, and the 
concrete example in Section |6?2] highlights this case. 



• In the case of a non-temporal system, X is defined 
as the relevant space of parameters that have an in- 
fluence on the system state at the point under in- 
vestigation, and similarly for Z being the space of 
parameters influenced by A. A common example in 
physics is the Markov Random Field representation 
of lattice systems [1] . In an image context X is the 
neighbourhood of a given pixel up to a range that we 
assume determines the statistical distribution of that 
pixel, and Z is the value of the pixel [1] . The concrete 
example in Section |6.3| highlights this case. 



thus apply it to other frameworks than the one presently 
considered. 

This section aimed at giving the reader an intuitive idea 



For each configuration x £ X, we'd like to associate a pre- 
diction y x G Z amongst all possible outcomes. The actual 
outcome z € Z can differ from y x : we have a range of pos- 
sible z £ Z, and they occur with a probability distribution 
p(Z\x). 

Let us now consider that the loss incurred by having 
acted according to prediction y when z is the future that 
actually happens is quantified by L{y,z), independently 
of the particular x for which y was chosen instead of z (so, 
L is a real- valued function defined on Z 2 ). We could equi- 
valently define a utility function with U(y, z) = —L{y, z). 
Minimising the loss is equivalent to maximising the utility, 
both concepts will be used interchangeably when needed. 
An important difference between the present context and 
MDP [TU1 l2"i] is thus that utility functions have two ar- 
guments: the utility quantifies our knowledge of how bad 
it is to make mistakes. Actions are based on predictions 
on what we think will happen, and are thus mapped to 
subsets of possible futures (ex: "going out for a hike" is 
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mapped to "it won't rain in the next hours"). Actions 
are implicit in the utility function: The utility function 
quantifies the effect of having taken an action based on a 
prediction y, while z actually happens. It is not a globally- 
defined quantity at the current system state. 

We can recover a global quantity by computing the ex- 
pected utility, integrated over all possible futures that may 
happen. The expected utility, when it exists, is: 

E[E/] = f [ U(y x ,z)p(x,z)dxdz 
or in a discrete scenario: 

E[[/]^^[/( fo z)j,(v) 

And for all x € X with non-null probability 

x£X z£Z 

The goal is usually to find a function y x that maxim- 
ises the expected utility: this would correspond to mak- 
ing the best predictions on average (and implicitly acting 
accordingly). By analogy with causal states [5] we now 
cluster together configurations x according to their stat- 
istical properties and look at conditions for which these 
clustering lead to maximal expected utility E[C/]. 

The user can then decide on an action based on the set 
of predictions leading to maximal expected utility, and / or 
based on the value of the utility itself. 



sliding window. Alternatively, other frameworks like In- 
teractive Learning might be better suited for these 
situations. 

Let us now recall the causal states construction |17j . 
Appendix A explains with more details how the notion 
might be derived in a physical context: 

Causal state equivalence relation: %\ = X2 if, and 

only if, the conditional distributions p(Z\x\) — 
p(Z\x 2 ) are the same. The equivalence classes <j(x) = 
{w : p(Z\w) — p(Z\x)} are called the causal states. 
See Appendix A for a discussion on this term. 

By analogy with the causal states construction, let us now 
define the following equivalence relations: 

Utility equivalence relation: x\ = x 2 if, and only if, 
max^g^U {y\xx) = max. y€ z U {y\x 2 ). That is, the 
maximal expected utility is the same at points x\ and 
X2, even if the sets of optimal predictions Y (x\) — 
argmaXyg^U (y\xi) and Y (x 2 ) that induce this util- 
ity might differ for x\ and X%. 

v 

Prediction equivalence relation: x\ = x 2 if, and only 
if, argmax aeZ U(y|a;i) = argmax ygZ U (y\x 2 ). That 
is, the sets of optimal predictions Y (xi) — Y (x 2 ) are 
the same, even if the utility induced by these predic- 
tions might differ for x\ and x 2 . 



3.2 Equivalence relations 

E[U] is maximal when each term T (x,y) = 
P ( x )^2zez U (y, z) p (z\x) is maximal (see Eq. [TJ). Since 
p{x) is constant for a given T (x, y), and assuming we can 
choose the y for each x independently, maximising T is 
equivalent to maximising each ^2 zeZ U (y, z) p (z\x). Let 
us note U(y|x) = E zeZ [U(y, z)\x] = J2 z ez U (v> z ) P ( z \ x )> 
the expected utility of choosing the prediction y for a 
given x. 

Another assumption is implicit in this argumentation: 
that making a decision does not modify the system. The 
weather forecasting example in the introduction falls in 
this category. However, sometimes taking a decision mod- 
ifies the system. For example, when monitoring a patient's 
health in order to decide whether to administrate a drug 
or not. In that case we have to rely on approximations 
(usually an additional assumption that the change is ef- 
fective only at a different time scale than that of the ob- 
servations) so we can still aggregate them on a recent past 

2 Technically we should introduce here a set X' = X\{x : p(x) = 
0} of all x £ X with non-null probabilities. In practice we are dealing 
with observed system configurations with non-null probabilities, and 
will act as if X' = X. 



Let us call iso-utility states v G T and iso-prediction 
states xjj 6 ty, the partitions of X corresponding to these 

equivalence relations: v(x) — |u> : w = ir| and ip( x ) = 
| to : w = xX. 

Let us call decisional states lu £ fl the intersection of 

f P u *1 

both: oj(x) — < w : w = x and w = x>. When both the 
expected utility and the optimal predictions are the same, 
we assume the decisions that are taken on the system are 
the same, hence the name. In other words, we suppose 
the utility function encodes all that a user needs to take a 
decision. 

These equivalence relations partition the configuration 
space X into clusters, with the corresponding properties 
common to all points in the cluster. It should be noted 
that E[U] as defined on the whole space does not consider 
which specific decisional state the process is in. Knowing 
which is the current cluster for any given point x allows 
us to refine the expected utility to a local (with 
x £ v the iso-utility state) and which decision to take to 
reach this utility (by refining again to the decisional state) . 
Section |3.4| details how to derive notions of complexity 
from these local expected values. 
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Iso-Utility State 

u = maxygz U (y\x 



Decisional State 

ds = ius n ips 



Causal State 

p(Z\x)=p(Z\cr] 

Data values x 




Iso-Prediction State 

Y = argmax y6Z l7 (y\x) 



Figure 1: Relations between the different states. 

3.3 Relation between the causal, iso- 
utility, iso-prediction and decisional 
states 

Let x\ and 22 be in the same causal state a. Then by 
definition of the causal states p(Z\x 1 ) = p{Z\x-i). In that 
case, the expected utility of any prediction y e Z is the 
same for x\ and X2- U(y|a;i) = Ylzez U (y, z)p [z\ x\) = 
J2zez U (y, z)p (z\ X2) = V(y\x2)- Therefore the optimal 

p 

predictions and induced utilities are the same: x\ = x-i 
and x\ = X21 and so is the combination of both. 

Thus the causal states sub-partition both the iso-utility, 
iso-prediction and decisional states. 

The converse is not true: we can have two distinct 
causal states a\ and o-i with the same maximum value 
of U(y\x) — J2zezU (y,z)p(z\x) at the same y points, 
but with different p{z\x) for at least one z£Z. 

Figure [l] shows the relations between the different states 
defined on the process. 

3.4 Transition graphs 

In the discrete case the causal states form a deterministic 
automaton, the e-machine [5] ■ Since the causal states sub- 
partition the other states as mentioned in the previous sec- 
tion, the iso-utility and iso-prediction states (and their in- 
tersections) also form automata: these are coarser-grained 
versions of the e- machine. Figure [2] shows decisional states 
gathering causal states of an underlying e-machine, with 
iso-utility and iso-prediction states on top of the decisional 
states. 

Let x be the current system configuration, and u> the 
current system decisional state (so x G u>). Let a G A 
be the next observed symbol, in a discrete scenario with 
alphabet A. Then xa = w E X is the system configuration 
after the observation. Let 7 be the decisional state for 

w. Then p(e w ^ 7 ) = Y, xeu J2aeAP ( xa £ l\ x e w ) is tne 
probability of the transition event from state u> to 

state 7. The same construction also works for iso-utility 
and iso-prediction states. 



Decisional state OJi 

Optimal decisions: {a, &}, E[U] — u 



Iso-Utility state 



Decisional state UJ2 
Optimal decisions: {a} 
E[U] = u 

Causal state a\ 
(Z\x)=p(Z\aJ 




Decisional state w,i 
Optimal decisions: {c, d}, E[U] = t 



Figure 2: Decisional states transition graph on top of the 
e-machine. 



In the case of causal states the transition events in the 
e-machines are further refined by labelling the transitions 
with the involved symbol in A: e"^. However in the 
present case this is not necessary: 

• Iso-utility state transitions are events that change the 
expected utility, irrespectively of the implied symbols. 
Several causal states might belong to the same iso- 
utility state, as depicted in Fig. [T] 

• Iso-prediction state transitions are events that change 
the possible optimal prediction choice, with the same 
comment. 

• Decisional state transitions change at least one of the 
above. 

Moreover, as shown in Figure [2] the transitions between 
decisional states might cover several transitions between 
different causal states that sub-partition the decisional 
ones, in addition to different symbols. While for the e- 
machine the transitions are defined using the symbols of 
the discrete alphabet, in the present case, there can be at 
most one transition from one state to another, irrespect- 
ively of the implied symbols. 

In computational mechanics [S] the mutual information 
C = I{x; a) between a configuration x and the causal state 
a for x is referred to as the statistical complexity. In the 
present context we might define by analogy a decisional 
complexity D as the amount of information necessary to 
retain about the configuration of a process in order to be 
able to make an optimal decision, given a utility func- 
tion. Once that utility function is fixed we can compute 
the decisional states and define D = I(x;uj) where a; is a 
configuration of the system and uj the decisional state for 
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3.5 Interpretation and notes 



In the discrete case, by definition D = I(x;oj) = 
H(bj) — H{uj\x) with H proper entropies (differential en- 
tropies in the continuous case). But then H(uj\x) = 
by construction of the u, which in the discrete case 
leads to well-defined transition graphs. Thus D — H(ui) 
in this case, the amount of information necessary to 
encode the decisional states. Since — piui) \ogp(w) = 

because — log is monotonically decreasing, then D < C 
in the discrete case. 

The same construction also works for defining similar 
quantities: 

• P = I(x; -0) is the difficulty to get the best predic- 
tions, tentatively called here the optimal prediction 
complexity 

• V — I(x; v) is the difficulty to estimate the expected 
utility of x. 

3.5 Interpretation and notes 

Decisional states are equivalent to merging those causal 
states which lead to the same decisions relatively to our 
utility function. In this case the causal states have lost 
their maximality property due to the fact we're only in- 
terested in making a prediction and not in keeping the full 
conditional distributions. We have, in the general case, 
clustered together the causal states that lead to the same 
optimal predictions and maximal expected utility value, 
based on a given utility function. 

Conversely this defines an equivalence relation amongst 
utility functions: Two utility functions U± and U2 are 
equivalent when they induce the same clustering of causal 
states into the decisional ones, with the same expected val- 
ues and optimal predictions. These utility functions would 
induce the same decisions in a system: they are function- 
ally equivalent. Isomorphisms between utility functions 
leading to the same predictions but with different utility 
values could also be defined: these are transformations 
of utility functions that preserve the iso-prediction states. 
Similarly, transformations could be defined that only pre- 
serve the iso-utility states. 

The transitions between the iso-utility states correspond 
to events that provoke a change in the expected utility of 
the system. Identifying these events might become a cru- 
cial practical application, for example for detecting when 
the expected utility reaches a predefined threshold. 

The transitions between the iso-prediction states cor- 
respond to events that provoke a change in the optimal 
predictions that can be chosen. Similarly, a user might 
be interested in monitoring these changes, for example, to 
maintain the current action as long as it is appropriate (as 
long as it matches one of the possible predictions for the 
system's evolving iso-prediction state). 

The hypothesis made here is that when the cost / utility 
is defined in terms of a functional (high-level) value, when 



it has a signification in high-level terms, then the trans- 
ition events also correspond to interesting high-level ob- 
jects to look at. This might form the basis for an auto- 
mated search for meaningful events in a given system's 
evolution. 

In any case, the utility function encodes external in- 
formation not available in the original data. So long as 
one stays with causal states, only information present in 
the low-level data can be obtained. Much like introducing 
a prior in a Bayesian framework, here the utility func- 
tion can be seen as encoding an a priori information not 
available in the original data. This has at least three con- 
sequences from an emergentist point of view: 

• The causal states represent the finest scale at which 
we can meaningfully associate a utility function and 
take decisions. Since the decisional states are super- 
sets of the causal states, then any partition of the 
data defined with respect to a utility function can- 
not go below that scale, whatever the chosen utility 
function. 

• Macro-level information needs not be computation- 
ally reducible [3] to the lower level in order to be in- 
corporated: The utility function is defined on Z 2 , not 
X, and it can possibly be incompressible, stated as 
a value table and not explicitly computed in terms 
of the lower- level scale. The data x G X is then 
clustered into sets which need not have a meaning 
defined at that level. 

• If the hypothesis that "emergent structures are sub- 
machines of the e-machine" [T7J sec. 11.2.2] is correct, 
then the decisional states are the emergent structures 
corresponding to a given utility function. Rather than 
looking for emergent entities directly we might then 
encode our knowledge in a utility function, and look 
at the decisional states in order to find good emergent 
entity candidates. If these do not suffice, we might 
then refine the utility function iteratively. 

Finally, it should be noted the utility function is not the 
only source of external introduction of knowledge in the 
system. Additional assumptions are made either implicitly 
or explicitly if the system is able to generalise to unknown 
values. For example, the hypothesis that p(Z\x) can be de- 
composed using kernels or Bayesian networks could be one 
such assumption. The accuracy of the proposed method 
for finding decisional states depends on how well these ex- 
tra assumptions are verified, independently of the chosen 
utility function. 

In particular, results where a bad utility value is ex- 
pected might be either intrinsic to the data (what we 
would like to detect) or an artifact indicating the general- 
isation/sampling/etc. assumptions are not appropriate at 
this point. Running the same data with different general- 
isation methods implying different assumptions would be 
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a way to detect these artifacts, provided enough computa- 
tional power. But this is not sufficient, some mathematical 
precondition might still not be satisfied. For non condi- 
tionally stationary systems in particular, it is not possible 
to aggregate observations taken at different times. In that 
case conditional stationarity might still be assumed as an 
approximation on a recent past sliding window, hopefully 
large enough to allow the collection of significant statistics 
on the p(Z\x). But points that fall beyond this window as 
time progresses should be removed. 

The algorithm proposed in Section [5] does not handle 
the verification of preconditions, which are expected to 
be performed by the user depending on the context (ex: 
nature of the data). However the reference implementa- 
tion (link given in Appendix C) is fully generic and allows 
testing different sampling and generalisation methods if 
needed. 



4 Analytic examples 

4.1 Example 1: when bad predictions are 
useless 

In this subsection utility is given to a prediction only if 
it is correct; otherwise, the prediction is declared useless: 
U(z, z) = 1 and U(y, z y) = 0. In a continuous scenario 
the delta function U(y, z) — 8(y, z) is used instead. 
From Section EOl 



zez 



U(y|x) = Y,U{y,z)p{z\x) 
then becomes: 



U(y\x) =p(y\x) 

The set Y{x) of predictions y realising an optimal gain 
becomes: 

Y(x) = {y : p(y\x) = max zeZ p (z\ x)} 
And Eq. ([TJ leads to: 

E max [U] = ^p(x) max zeZ p{z\ x) 
xex 

But for each causal state a C lo in each decisional state 
to the conditional probability p(Z\x) — p{Z\<r) is con- 
stant for each x <E a. The decisional states are found by 
gathering causal states with the same maxima points y for 
p(Z\ui). We can then write in this special case the above 
formula as: 

E max [U] = ^ ^2 P( a )P (V"\ a ) 

where y^is taken as any maxima of p[Z\a) common to all 
a C uj. 



Under the condition U(z,z) = 1 and U(y,z ^ y) = 
(or U (y, z) = 5(y, z) in the continuous case) the full 
conditional probability distributions p{Z\x) do not matter, 
what's important is that these distributions peak at the 
same maxima. 

4.2 Example 2: 
squared 



Loss defined by error 



This section investigates the case where the loss function 
L (y, z) can be written as a squared difference between 
the actual event and the prediction: L (y, z) = (z — y) , 
provided this operation is meaningful in Z. 

We re-develop the treatment from [51 section 1.5.5] in 
our new context: 

With the above loss function Eq. [TJ becomes: 

E[L] = / / L (y x , z)p(x, z)dxdz 
J zez J xex 



E[L]= / / (z-y x ) p(x, z)dxdz 
J zez J xex 



As in Section [3T| the goal is to find an y x function that 
minimises the expected loss: E m i„[L]. 

The extrema of E[L] are given by the functional equa- 
tion Igl = 0, with: 



dE[L] 



2 (z- y x )p{x,z)dz 
"Vx Jzez 

Solving = gives: 



zp(x, z)dz = y x I p(x,z)dz 
zez Jzez 



So except for a set of x with null probability mass: 



p(x) / zp(z\x)dz = y x p(x) 
I zez 



y x = E z&z [z\x\ 



(2) 



For a given causal state er, p(Z\x) is constant for x G a 
so we can write y a = E ze z[z\cr]. 

The decisional states are in this example obtained by 
clustering together the causal states with the same expec- 
ted value of z within the state. 

These results are obtained because the utility function 
can be treated analytically; in the general case we do 
not have such simple formula available. The next sec- 
tion presents an algorithm that can infer the structure of 
the decisional states from observed data and numerical 
integration. 
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5 Estimating the decisional states 
from data 

5.1 General presentation of the algorithm 

There are two distinct tasks the algorithm must perform: 

• Estimating the probability distributions p(Z\x) and 
p(X) from data. The probability distribution estim- 
ator must act on the whole space X. It is responsible 
for providing values for unobserved data (generalisa- 
tion ability). It might use all available observations: 
p(Z\x) = F(0), where O = {{xi,Zi) i=1N } repres- 
ents the data in the form of observation pairs (xi, Zi), 
and F a generic function. 

• Clustering X into causal, iso-prediction, iso-utility 
and decisional states according to the user needs. 
This implies as a sub-task estimating the maxima 
for V (y\x). The first step is to build U(y|x) = 
J z p(z\x)U(y, z)dz with a user-provided integrator 
and the utility function. Then, an optimiser might be 
invoked so as to compute Y(x) = Argmax ygZ U(j/|x). 

So, in summary, the user must provide: 

• A probability density estimator p(Z\X) from data ob- 
servations O. 

• A utility function U acting on Z 2 with Z the space 
of predictions. 

• An integrator for computing the expected value of U 
with respect to the estimated density. 

• A multi-modal optimiser in order to compute Y(x) = 
Argmax y6Z U(y|x). 

• A clustering algorithm for gathering probability dis- 
tributions (for causal states), utility values (for iso- 
utility states), or similar sets Y(x) (for iso-prediction 
states). Decisional states are found by intersection of 
the iso-utility and iso-prediction states. 

• Optionally, the user may associate a symbol to each 
(xt,Xt+i) consecutive configurations. This step is 
detailed in Section |5.5| once the main algorithm is 
explained and we can see why and when this step 
may become necessary. 

Figure [3] recapitulates these points and shows the al- 
gorithm steps that will be detailed in the next sections. 

The reference CH — h algorithm implementation (see 
Appendix C) handles common cases by proposing dis- 
crete and kernel density estimation, deterministic or 
Monte-Carlo integration, an exhaustive optimiser espe- 
cially suited to small search spaces and a simple Genetic 
Algorithm for larger spaces, and a single- link hierarch- 
ical clustering algorithm that finds connected components 



Data inputs: 

- Pairs of observations O = {(xi, Zi) i=1 N }; 

(Optional): Symbols (sj). =1 i\r— 1 ^ or eacn transition 
{xi, x i+1 ). =1 N _ 1 ; 

- Parameters for the functional inputs (ex: threshold for matching 
probability distributions). 

Functional Inputs: 

- A probability density estimator PDE such that p(Z\X) = 
PDE(0). The distribution type is user-defined; 

- A utility function V : Z 2 >-» R; 

- An integrator Integ over Z\ 

- A multi-modal optimiser Argmax over Z; 

- A clustering algorithm CI acting on probability densities p(Z\X); 

- A clustering algorithm CI over subsets of Z; 

- A clustering algorithm C3 over R. 

Algorithm: 

1. Build the density estimates p(Z\xi) for each Xi in the data set 
using PDE. 

2. Cluster the density estimates using CI into causal states a. 

3. (Optional) Refine the estimates a a nd lo op to step 2 using the 
symbols (sj)j =1 jv— i- See Section 5.5 

4. Average out p(Z\a) = avg a!i £ r jp(Z\xi). See Section |5.4| 

5. Compute Y[a) = Argmax y Integ z (U(y, z)p(z\a)) for each 
causal state estimate cr, retaining the utility U(a) obtained 
for these maxima. 

6. Cluster the causal states estimates using Y(a) and C2 into 
iso-prediction estimates $>(<r) 6 "I". 

7. Cluster the causal states estimates using U(cr) and C3 into 
iso-utility estimates 6 X. 

8. Intersect $nT into decisional states Q that partition X. 

9. (Optional) Produce the transition graphs, and the e-machine if 
the symbols are available. 

10. (Optional) Compute the global complexities of the system C, 
D, P, and V from section [3^4] 

11. (Optional) For each Xi, compute the local complexity equival- 
ents of C, D, P, and V at this point (the mutual information 
between tc^and a(xi), ui(xi), ip(xi) and v{xi) respectively). 

Figure 3: Decisional state reconstruction algorithm. 
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with a user-defined match predicate and threshold. The 
algorithm is parallelised, allowing the use of multi-core 
CPUs. The reference implementation is additionally fully 
templatised and generic, allowing users to plug in their 
favourite routines: These are the functional inputs of the 
algorithm in Fig. [3] 

5.2 Kernel density estimation 

This section describes one way to perform Step 1 in Fig. 

m 

A discrete probability estimator is suitable for small X 
spaces where a sufficient amount of data was observed, so 
that p(Z\X) can be reliably estimated by counting occur- 
rences of all x and z. For larger spaces or when unknown 
or continuous X might be encountered, the system must 
be able to generalise. We now present the case for a Kernel 
Density Estimation (KDE) [20 of the probability density 
p(Z\X) = F(0). 

In general, the kernel K(a; b) with a and b in the joint 
space {a, b} C X x Z is not separable: The density es- 
timate is p(x, z) cx ^2^ =1 K(xi,Zi;x,z), summing over 
all observation pairs (xi, Zi) G O. In the particular 
case of separable kernels for the configuration space X 
and the prediction space Z we have instead: p(x, z) cx 
Y^ii—i K x (xi] x)K z (zi\ z). Even when the kernel is separ- 
able the user may benefit from the joint kernel approach: 
For analysing time series it is natural to consider a mov- 
ing window of dim(X x Z) values and perform the density 
estimation on the joint space. In another example in Sec- 
tion |6.3| an image is considered as the limit distribution of 
a Markov Random Field [1J, and the density estimation is 
also performed on the joint space (with Z being in that 
example the space of pixel values and X the space of pixel 
neighbourhoods) . 

In any case, the conditional probability density is estim- 
ated by integrating out the p(X) factor over Z: p(z\x) — 
p(x,z)/ j^ eZ p(x,C)- Several sampling mechanisms are 
provided over Z for the integration, including exhaustive 
listing of Z for small search spaces. The adequate method 
depends on the particular user application. 

Computing the causal states (and the other states built 
on top of the causal states) only requires the conditional 
distributions, and not the joint ones. Without loss of gen- 
erality it is thus possible to request that K(a, a) = 1 
with a being x or z or the joint data (x, z) depending 
on the above cases. For example, the radial basis func- 
tion K(a,b) = e~ll a ~ fc H l h 1 with h the kernel width. In- 
deed, dividing by p(X) absorbs the change of scale. Bet- 
ter numerical accuracy is however achieved by requesting 
K(a,a) = 1, especially in high dimensions where the mul- 
tivariate Normal kernel would lead to very small K(a, a). 

The discrete case is recovered when choosing the delta 
function as a kernel. In that case, similar observations 
(xi,Zi) are effectively summed up for a given Xi and the 
probability estimator is an histogram. In practice it is 



preferable to use a specialised discrete estimator imple- 
mentation for efficiency reasons in small search spaces. 

Finally, the kernel width h can be chosen according to 
a variety of estimators from the data [5]. In practice it 
has been observed that results ultimately depends on the 
final task for which the algorithm is applied to. h is then 
considered as a free parameter, which can be determined 
for example by cross-validation or by using a genetic al- 
gorithm. This gave the best results for classification tasks 
based on the decisional complexity feature (unpublished 
author work in progres^J. An hypothesis is that while 
the kernel width h found this way does not realise an a 
priori form of optima (like the AMISE |9J), it realises an a 
posteriori ideal compromise between bias and variance in 
the estimated density for the particular task the algorithm 
is applied to. This is similar to the approach in [3] except 
that we have reduced the meta-parameter search to h and 
got rid of the histogram boundaries by using a KDE. 

The default implementation proposes a reasonable 
choice based on the average distance between nearest data 
points, from which the aforementioned cross-validation 
and search techniques can build on. 

5.3 Using the probability estimates 

Two operations are performed using p(Z\x): 

• Comparison: We need to check whether p(Z\x\) and 
p(Z\x2) are similar for clustering or not x\ and x^ in 
the same causal state. 

• Expectation: We need to estimate the expected util- 
ity of a prediction y € Z for a given x £ X: \}{y\x) = 

f zez p( z \ x ) u (y> z )- 

Comparison is handled by choosing a similarity measure 
between probability distributions. The reference imple- 
mentation proposes the x 2 statistic, the Bhattacharyya, 
Variational and Harmonic mean distances, and the Jensen- 
Shannon divergence [IT]. The Bhattacharyya distance is 
the default for the Kernel Density Estimation, and a x 2 
test is the default for the discrete case. 

Let S C Z be a set of sample points used over Z 
for comparing the probability distributions (possibly with 
S = Z for an exhaustive approach). Expectation of the 
utility for a candidate y G Z is simply performed nu- 
merically over Z at the chosen sample points S C Z: 
= (EsesP(s\*My, s)) I (EsesP(^))- 

5.4 Clustering 

Clustering of the causal states is performed directly by 
matching the probability distributions as defined in the 
previous subsection. 

3 Note to the editor: the corresponding preprint will probably be 
linked by reference in the final version of this article. 
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5.5 Ensuring the e-machine determinism 



For iso-utility and iso-prediction states an additional 
step is necessary: Optimising U(y|x) in order to find 
Y(x) — ATgmaXy£ Z iJ(y\x). Any multi-modal optimisa- 
tion scheme can be invoked at this point. Equivalent best 
predictions y € Y(x) must be found, so uni-modal search 
schemes returning only one candidate are not adapted. 

Once the prediction sets Y(x) and the optimal utility 
values are computed it is possible to cluster them. Ap- 
proximate matchers might become necessary for checking 
the appropriate equivalence relations for each state defin- 
ition, due to numerical precision, limited data size, etc. 

False positives are when X\ and X2 are clustered to- 
gether when they are mathematically not equivalent, false 
negatives are when the points are in different states when 
they should not. These risks are minimised by providing 
more sample points to look at and by increasing the data 
size. In the limit of an infinite number of data points and 
samples, consistency is determined by the chosen approx- 
imate matchers (ex: the Bhattacharyya distance in the 
previous section) and by whether the data respects or not 
the mathematical assumptions needed for the theory to 
work (ex: conditional stationarity of the p{Z\x)). 

Additionally we can exploit the fact that causal states 
sub-partition the decisional ones. If we compute the iso- 
utility, iso-prediction and decisional states first we might 
then restrict the search for similar probability distribu- 
tions p(Z\xi) and p[Z\x2) to points {xi,X2} C u> within 
each decisional state to. 

If we compute the causal states first (as in Fig. [3]), 
we might use representative p(Z\cr) distributions for each 
causal state, by averaging all p(Z\x) for x € a in order to 
reduce numerical discrepancies: p(Z\a) = avg xG(T p(Z\x). 
As aforementioned all distributions should be the same 
mathematically but in practice the estimators are not per- 
fect. This step becomes an effective way to reduce the 
discrepancies. The expected utility U(y|a;) is then set to 
V(y\cr) for all x 6 a. This approach (causal states first) 
was found to give better results in practice. 

The chosen clustering algorithm must also handle the 
unavoidable errors in the estimation of the clustered 
quantities from finite data: probability distributions 
p(Z\x) for causal states, and some tolerance for floating- 
point utility value comparison. Ideally, all x with exactly 
equal such quantities should be put together by definition 



of the equivalence relations for each state (see Section 3.2 1 



However due to the estimation errors some tolerance has 
to be given for the equality, which then introduces side- 
effects like the loss of the transitivity relation, etc. 

A single-linked hierarchical clustering is then performed 
by default: this is equivalent to finding connected com- 
ponents with respect to the given match predicate. The 
reasons for this choice are: 

• Occam's razor: we want to find the simplest model 
able to handle the data. Connected components 
maximise the clusters size by gathering data when a 



matching path is found between them. This leads to 
a minimal number of states in the discrete case while 
ensuring that data in different states do not match 
(consistency), hence minimal statistical C = H(a) or 
decisional D = H(lj) complexity values. 

• An interpretation for the continuous case. Connec- 
ted components ensure d(a, b) > A for a and b points 
in different clusters, d a dissimilarity measure (ex: in 
utility values for iso-utility states, on probability dis- 
tributions for causal states, etc), and A a threshold 
for the mismatch between clusters. This is equivalent 
to single-linked hierarchical clustering where we cut 
the hierarchy at level A. In the continuous case the 
transition graph construction fails on a continuum of 
infinitely many nearby states. In that case connected 
components with threshold A ensure that the system 
state changed at least by that amount when transiting 
from one point to the next in a different component. 
For example when monitoring a system expected util- 
ity value, a decision might be taken only when a sud- 
den change is detected, but not for a gradual change 
of the same magnitude. 

5.5 Ensuring the e-machine determinism 

Causal State Splitting Reconstruction (CSSR) [19] is the 
reference algorithm for reconstructing e-machincs on dis- 
crete strings of symbols. It works by recursively splitting 
the current causal state estimates as the string length is 
increased. The consistency on shorter string lengths is 
maintained while the causal states are refined to take in 
account more symbols. In the limit, it provably converges 
to the true causal states. 

In the present case we do not act on strings of sym- 
bols but on (x, z) mappings. Hence it is not possible to 
refine iteratively the current causal state estimates by en- 
larging the dimensions of X and Z. Yet the "symbols" 
of discrete data are implicitly present in the (xi^Xi^) 
transitions when monitoring the system and the order of 
the data presentation matters (the index i then corres- 
ponds to ordered time steps). It would be possible to 
recover a symbolic representation of the data set from all 
such transitions, and apply CSSR if so desired. Here we 
directly cluster the system configurations x € X, not ne- 
cessarily represented as strings of symbols. For example, 
each x 6 X might correspond to a past light cone (see 
Appendix A). 

The drawback is that the proposed algorithm does not 
so far ensure that the resulting automaton is deterministic 
in terms of symbol transitions. The labelled transitions 
between states can be recovered by looking at the sym- 
bol suffix implied by passing from Xi to Xi+i. But there 
is no guarantee that a given (state+symbol) combination 
always lead to the same state deterministically. 

Example: Suppose that Xi = aaba and Wi = abba are 
in the same causal state: p(Z\xi) and p(Z\wi) match and 
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were clustered together with string length limited to depth 
4. We observe that Xi+i = abac and Wi+± = bbac, with 
the same suffix c, yet p(Z\xi-\-i) and p(Z\wi-^i) do not 
match anymore and therefore were not clustered in the 
same state. This is a violation of the e-machine determ- 
inism: from the same state and with the same symbol, 
the transition leads to different states. Yet this case is 
possible when clustering independently £Cj, Wi, Xi+i, iMj^i 
into their own states as we do. 

For iso-utility, iso-prediction and decisional states this is 
not a problem: As explained in Section [3 . 4| transitions are 
determined in terms of changes in utility related quantit- 
ies, the string symbols are irrelevant in that case. For an 
e-machine reconstruction however the proposed algorithm 
needs to be augmented by an additional step. 

The user can optionally express symbol values together 
with each x^ — > x^x transition. These are used as con- 
straints for the clustering algorithm when they are avail- 
able. The following procedure is implemented: 

• Before clustering: If several Xi = x Xi+i = w 
transitions are observed, with the same configuration 
value x £ X and symbol a £ A, then all correspond- 
ing w £ X are pre-clustered in the same state. 

• Clustering is performed as described in Section |5.4| 

• After clustering, iterate the following steps: 

— Split step: It might be that data X\ and X2 
for the transitions x± A and x 2 A w 2 
were clustered together, while u>i and w 2 are 
not clustered together (see the above example). 
In that case, the state containing x± and x 2 is 
split in order to restore determinism. 

— Merge step: As before clustering. Note that 
splitting acts on the states of mismatching con- 
figurations before a transition, while merging 
acts on the states of mismatches after a trans- 
ition, so both can be applied without undoing 
each other. 

— Break the loop in the case of incompatible con- 
straints and there is no convergence. 

Convergence of the loop would effectively ensure determ- 
inism of the reconstructed automaton in the perfect case 
where all distribution estimations are exact. 

Unfortunately this is not the case in practice. Indeed, 
clustering from finite data is necessarily imperfect. If 
x £ a 2 is wrongly affected to causal state <j\ then forcing 
symbol determinism might create spurious states: o 2 is 
erroneously split until the transitions are consistent, while 
the source of the inconsistency is not detected. Or simil- 
arly states are merged when they should not. 

We had to accept a threshold for clustering distributions 
together (ex: a significance level for the Chi-Square test) , 
due to the imperfect distribution estimation. In turn, we 



have no choice but to accept that some x £ a might be 
misclassified and might generate spurious transitions. The 
same way we ignore small discrepancies in distribution 
clustering, the solution is to ignore small discrepancies in 
the automaton determinism. Formally: 

Let a be a causal state, a £ A a symbol in the alphabet 
A. The automaton is deterministic when each time a data 
value x £ a is followed by the symbol a then w = xa falls 
in a unique causal state <p, Vie £ a . When the automaton 
is not deterministic there is instead a distribution p(<j>|<7, a) 
with tp £ $. 

We propose here to set a threshold 9 = 1 — e > \ for ig- 
noring small discrepancies up to e: When 3ip/p(p\a, a) > 9 
then the unique such ip is taken as the automaton trans- 
ition. This is completely independent from the probability 
of the transition itself p(A\a). 

Concretely, the split and merge step described above are 
applied only on such transitions tp, ignoring the spurious 
transitions. 

5.6 Complexity of the algorithm 

Depending on the user context, one or the other of these 
tasks might become the dominant algorithm cost: 

• Estimating the probability distributions p(Z\x). Us- 
ing the above Kernel Density Estimation the complex- 
ity is roughly O (N(M + q(N))) with N the number 
of data and M the number of samples s € S C Z at 
which p(Z = s\x) is estimated. q() is the cost of per- 
forming a nearest neighbours query in the joint space 
(so negligible kernel values are quickly eliminated, the 
worse-case limit of q = O(N) is the summing of all 
kernel values at all data points). 

• Clustering tasks. In particular comparing probability 
distributions is more costly than comparing points, so 
clustering causal states is usually expensive. Finding 
connected components might then use up to O(KN) 
calls to the dissimilarity measure, with K the final 
number of clusters. 

• Evaluating the utility function. For the analytical ex- 
amples in the Section |4j U{y,z) is simple enough so its 
evaluation was not the main issue. However in a dif- 
ferent scerario the algorithm complexity might have 
to be defined in terms of the number of evaluations 
of the cost function. 

• Optimising U(i/|x) in order to find Y(x) = 
Argmax ygZ U(y|x). An exhaustive search of Z is only 
feasible for small discrete spaces. Advanced multi- 
modal optimisation techniques might become neces- 
sary and induce large computation times. 

Finally the memory requirements for running these com- 
putations might exceed by far the current processing cap- 
abilities if care is not taken. For example it might not be 
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possible to store all p(Z\x) distributions for each unique 
x £ X present in the data set, especially if a large num- 
ber of samples s £ Z are needed (ex: the Monte-Carlo 
sampling error decreases as 0(1/ y/\S\)). 



• The clustering algorithm finds the connected compon- 
ents (see Section 5.4 1. Symbol constraints are avail- 



able and implemented as described in Section |5.5| with 
a tolerance threshold 8 = 0.95. 



6 Application examples 

6.1 Hidden model reconstruction of the 
Even process 

This first application example demonstrates the capabil- 
ity of performing an e-machine reconstruction. Since the 
e-machine is the minimal and optimal deterministic auto- 
maton for reproducing a process statistically, and since the 
decisional states transition graph is a sub-machine of the 



e-machine (see Section 3.5), the proposed algorithm needs 
to perform well on this task. Moreover recovering a hidden 
generative model from symbolic observations is a common 
task for many practical applications, as in 1 16] where a 
comparison is explicitly made with e-machines and finite 
memory Markov models. 

The Even process is used as a benchmark in [TO] • A sim- 
ilar experiment is conducted with the proposed algorithm 
for comparison. 

The Even process consists of two states, and generates 
binary strings where blocks of an even number of Is are 
separated by an arbitrary number of 0s. Despite its ap- 
parent simplicity the Even process does not correspond to 
any finite-length Markov chain |16| . and requires the power 
of an e-machine to be reconstructed. Figure [4] shows the 
process states and transitions. 



0, P : 



Figure 4: Definition of the Even process. 

Data is generated according to the Even process as a 
series of symbols. The goal of the experiment is to recon- 
struct the underlying transition graph from these obser- 
vations. 

The algorithm described in Section [5] is set up with the 
following parameters: 

• System configurations x £ X are taken as the symbols 
in a sliding window of size L past data values. The 
predictions z £ Z = {0,1} are the symbol in the series 
following this window. 

• Discrete distributions are built by monitoring (x, z) 
pairs in the training set of size TV generated associ- 
ations. 

• A Chi-Square test is used in order to match distribu- 
tions, with 5% accuracy. 




• We are not interested in this example in decisional 
states, so we do not set a utility function. 



The result of one typical reconstruction, using N = 10 6 as- 
sociations and a past window of 10 points, is shown in Fig. 
[5] The recurrent causal states of this e-machine correctly 
correspond to the definition of the Even process. Close 
inspection of the data shows that the transient state cor- 
responds to strings formed of 10 symbols 1 in a row. Due 
to the limit in window size the algorithm cannot distin- 
guish whether the last symbol 1 was emitted from recur- 
rent state A or B. Logically, it observes that \ of the time 
the next symbol is a in the data set and | of the time 
it is a 1, matching the proportions of the symbols in the 
data set: p(s = 1) = p(s = l\A)p(A) + p(s = l\B)p(B) as 
the process is really in either the state A or the state B. 



s = l,n = 0.677 




P = 



s = 0,p = 0.501 



(Recurrent states 



0.323 




= /« = 



Parameters: 10 6 data points, using a past window size of 10 
points (default random seed 42). 

Figure 5: Reconstruction of the Even process. 



The proposed algorithm classifies every single training 
point in a causal state, hence creates transient states if ne- 
cessary to match data. The underlying process recurrent 
states can be found from the strongly connected compon- 
ent of the e-machine graph. Note that despite the Even 
process not being equivalent to any finite-length Markov 
chain, the proposed algorithm reconstructs it fairly well 
with a window size of 10. 

In |19| an experiment is conducted to study the beha- 
viour of the CSSR algorithm depending on the history 
size. The transposition of this experiment is conducted in 
the current framework in order to highlight the differences 
between both algorithm behaviours. 
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Configuration Space 



Number of points retained from the past 

Figure 6: Average number of recurrent states reconstruc- 
ted from the Even process. 

Figure [6] shows the result of that experiment: how many 
recurrent states are found on average (over 30 independ- 
ent trials) by the proposed algorithm, depending on the 
window size. This diagram is directly comparable to |19[ 
Fig. 4]. 

• Results for L = 1 and L = 2 produce an incorrect 
transition graph, there is not enough history to reli- 
ably determine the states. The two states for L = 1 
in Fig. [6] are thus not the correct ones. 

• CSSR may split the states with each increase in the 
window size, whereas the present algorithm clusters 
states using the whole window and symbol con- 
straints. When there is not enough data to estim- 
ate the distributions properly CSSR over-splits the 
states. The current algorithm merges them thanks to 
the tolerance on state transition determinism as ex- 
plained in Section |5.5| In either case the underlying 
problem remains the same: Distributions over large 
L cannot be correctly estimated from a small number 
of observations. As the number of states determines 
the estimated statistical complexity the user can se- 
lect which algorithm to apply depending on whether 
over or under estimated complexities are preferred. 

• When there is enough data to estimate the distribu- 
tions correctly the proposed algorithm gets more pre- 
cise as the window length increases. 

6.2 Cellular automaton 

Another test case where causal states were applied is the 
detection of moving particles in cellular automata, and 
their interactions |18j . The introduction of a utility func- 
tion in this context provides a simple yet effective way 
to demonstrate the concepts presented in this document. 
The next section considers the usage of decisional states 
in a larger-scale application. 



i~l ^ as ^ n Present [J Future 

Evolution rule (Ex: rule "110") 

Cells become full or empty | — | — | — , 

according to the immediate H I 

past state of the rule support *H 



Figure 7: Elementary cellular automaton. 

Figure [7] shows the evolution rule and configuration of 
an elementary one-dimensional automaton. Each row of 
the regular grid contains the system state at a given time. 
An evolution rule dictates how the cell binary states evolve 
at each time step. The evolution rule has a support, 3 
cells in figure [7] from which the next cell configuration 
is deduced. Propagation of this support in time defines 
"light-cones" according to the terminology of Appendix A, 
within the implementation constraint of a limited depth. 

In this context the data set X is the space of all past 
light-cones (in blue on Figure |7j|. From the current system 
state we would like to predict the future of the system, so 
Z is the space of all future light-cones (in red on Figure 
[7]). Even though the cellular automaton is completely de- 
terministic, the state of cells in the future cone depend 
on information which is outside the past cone, so we ob- 
serve a distribution of different futures for each past. See 
also Appendix A. With cyclic boundary conditions and 
a fixed evolution rule for the whole automaton, all cells 
have exactly the same distribution so we can aggregate 
the observations across all cells. 

The utility function is chosen by the user according to 
the application needs. Here we chose to define the utility 
of a prediction as the number of correctly predicted cell 
states in the future cone. Hence the utility takes in this 
example integer values between and the maximum d — 1 
where d if the future cone depth (we could also have used 
a proportion between and 1). 

Given the discrete nature and relatively small search 
space of the problem, the algorithm described in Section 
|5.1|is setup with: 



• A simple discrete probability density estimator based 
(x.z) observation counts: p(z\x) = c °""^[^) , 



on 



• An exhaustive integrator weighting the utility of all 
possible future cones by their probability: U(y|a;) = 
^2zcz P(y\ z )U(lli z )- In practice unobserved z values 
would induce a null contribution so the summing oc- 
curs only on observed z. 

• An exhaustive search optimiser, computing U(j/|x) for 
all possible y G Z. The maximum utility value as well 
as the set Y(x) = Ai-gmax yeZ V(y\x) of best predic- 
tions are maintained during the search. 
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6.3 Image filtering and edge detection 





Top-left: Raw cellular automaton field. 

Top-right: Statistical complexity field (difficulty to get the fu- 
ture distribution). 

Bottom-left: Iso-Utility complexity field (difficulty to get the 
maximal expected utility). 

Bottom-right: Iso-prediction complexity field (difficulty to get 
an optimal prediction). 

Parameters: Past depth 6, future depth 4, 400 cells, 300 steps, 
100 initial transient dropped, default random seed 42. 

Figure 8: Raw cells and complexity fields of a cellular 
automaton. 



• The connected component, single-link hierarchical 
clustering algorithm described in Section [5.4| with ex- 
act match predicates. 



Figure [8] shows the results of the experiment for the rule 
"110" introduced in Figure [7] It can be compared with 
[151 Fig. 3]. The raw cellular automaton field is the 
direct application of this rule. The statistical, iso-utility 
and iso-prediction complexity fields are mapped to a grey 
scale range where white represent their respective min- 
imum complexity value and black their respective max- 
imum. 

As expected the causal states sub-cluster the iso- 
predictive and iso-utility states: we observe alternative 
versions of the particles. The utility is based on the num- 
ber of correctly predicted cells, irrespectively of their po- 
sition in the future cone. Information that is irrelevant 
to this utility function is masked out in the iso-prediction 
field, whereas it was present in the statistical complexity 
field. 

Some information was lost. But if all the user cares is 
encoded in the utility function, that information was noise 
and clarity was gained in the result. In extension to [18 , 
we have defined a new family of automatic filters based on 
utility functions. 



| | x: neighbourhood 
| | z: point to predict 

Figure 9: Neighbourhood for image filtering. 

6.3 Image filtering and edge detection 

The idea of this section is to extend the cellular auto- 
maton example for filtering images. We make the hypo- 
thesis that edges correspond to zones where the predic- 
tion difficulty is greatest. This differs from other common 
definitions, like a high luminance gradient magnitude. The 
background pattern of the previous cellular automaton ex- 
ample is a case where gradient-based filtering would detect 
edges while the proposed method assigns a low-complexity 
value, which may or not be better adapted to a user's 
problem. The definition of an edge is not the topic of this 
article. This section's goal is to show how the concepts 
introduced in this document might be used on a concrete 
non-temporal data example. 

The family of filters created by statistical or decisional 
complexity have unusual properties: 

• They are defined globally on the whole image (or im- 
age sequence), but they are applied locally (for each 
considered x £ X). 

• They detect zones which are statistically different 
from the rest of the image. This presents an interest 
in itself, especially if the user is able to provide an 
adequate utility function. 

This framework may possibly be adapted for generic fea- 
ture detection with a variant setup. In this example scen- 
ario we are considering edge detection. 

Following the construction in [T], the data space X is 
defined as the neighbourhoods of image pixels z G Z. 
The prediction problem is to find the value of z from the 
neighbourhood. Figure [9] shows how the neighbourhood is 
defined in this experiment: up to two pixels in each dir- 
ection except corners. Larger or smaller neighbourhoods 
were tested: smaller regions make the prediction more dif- 
ficult, larger regions lead to thicker detected edges. The 
adequate size of the neighbourhood depends on the ap- 
plication. Similarly defining z as a centre block instead of 
a single pixel was also tried, with similar observed effects 
(precision and edge size) . For a 8-bits grey image the data 
space is thus X — [0... 255] 20 and the prediction space 
Z=[0...255]. 

The data space X is thus considerably larger than in 
the previous examples. Fortunately, unlike the cellular 
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6 APPLICATION EXAMPLES 



automaton case where a difference in x can lead to com- 
pletely different predictions, usually images are not signi- 
ficantly altered when pixel values differ by a small amount. 
In the present context we exploit this nearby consistency 
in X and Z in order to apply kernel density estimators. 
These are more reliable than the simple count-based es- 
timator used in the previous section, especially since X 
has a higher dimension. 

The approach of considering the image as the limit dis- 
tribution of a Markov Random Field [TJ is applied to this 
example. Concretely, this amounts to estimating the prob- 
ability densities in the joint space X x Z and inferring the 
conditional distributions by integration of p(X), as de- 
scribed in Section 1531 

The prediction space Z can be run through exhaust- 
ively in the case considered here: only one centre grey 
pixel. Sub-sampling for numerical integration is thus not 
necessary, and we set S = Z. 

The utility of a prediction y when the true value z hap- 
pens is defined as U(y,z) = — max (0, \y — z\ — t). In 
other words, we accept small prediction discrepancies up 
to t at no cost, reflecting the fact the image is not sig- 
nificantly altered by small variations in pixel grey levels. 
Then the utility decreases (the loss increases) with each 
grey level difference between the predicted and the true 
value. 

In a first experiment we compute the decisional com- 
plexities of each pixel, and plot these values on a grey 



scale map. The bottom row of Figure 10 corresponds to 
this scenario. In a second experiment a pre-precessing is 
performed. For each region x we compute the minimal 
and the average grey level. Either of these can be sub- 
tracted from both x and z without loss of genericity (we 
would still be able to reconstruct the original grey level 
for a prediction z by adding back the value shift defined 
on x). In practice it was observed that subtracting the 
minimal value of the neighbourhood leads to better res- 
ults than subtracting the average. The middle row of Fig- 
ure [10] uses this pre-processing. Finally the best results 
are obtained when ordering the states by their complex- 
ity values and plotting the ranks of the states instead of 
the complexity values themselves on a grey scale. The left 
image on the top row shows the effect of this transform, 
to be compared to the left image on the middle row. The 
result of a Sobel filteiQis presented for comparison, as well 
as the original image. All filter results are normalised^ in 
grey scale range. 

Compared to the Sobel filtering the proposed technique 
has several distinctive characteristics: 

• The filter is defined globally: features on the whole 
picture are taken into account. For example the flat 
zones in the original image are detected as having low 



4 Obtained with The Gimp, http://www.gimp.org/ 
5 Using 



ImageMagick 



"normalize" 



filte 



http:/ /www. imagemagick.org 



complexity. When the filter is applied locally this in- 
formation is taken into account. While the Sobel filter 
is sensitive to noise in the local grey level gradients, 
the proposed filter assigns a low complexity to similar 
regions. The extreme example would be the cellular 
automaton background in Figure [8] the Sobel filter 
would detect many edges for each small triangle pat- 
tern, while the proposed filter assigns a low (blank) 
value for these patterns. 

• Fine details are similarly considered statistically on 
the whole picture. In the bottom-left region of the 
picture, the filaments attached to the hat are detec- 
ted as single units: each light-dark transition has its 
own complexity, low values are whitened out by the 
ranking. The Sobel filter detects two transitions, one 
on each side of the filament, according to grey level 
differences only. Note that this is not a perfect effect: 
some filaments (on top of the hat light background) 
are also detected as two transitions with the proposed 
filter, depending on the global statistical properties of 
each involved light-dark transition. Yet in general the 
proposed algorithm produces finer details and less ar- 
tifacts. 

• In a different image where each light/dark edge pat- 
tern would occur with a different frequency the pro- 
posed algorithm would produce a different result loc- 
ally, while the Sobel filter would be unaffected by the 
rest of the picture. This might be an advantage or 
not, depending on the application. 

• Making global statistics and clustering probability 
distributions comes with a computational cost. While 
Sobel filtering is very efficient, applying the proposed 
algorithm requires comparatively large computation 
times (in the order of one to two hours with a dual- 
core 3.16GHz Intel CPU). 

The effect of increasing the kernel size is shown in the 
middle and bottom rows. A small Gaussian kernel size 
effectively induce a bad generalisation (over fitting in an- 
other context). This is apparent in the form of a noisy 
background. The effect of using a rank-based representa- 
tion attenuates this noise while keeping the high-frequency 
edge components that have high complexity values (top- 
left vs middle-left images). Larger kernel sizes smooth 
more and more details. In the bottom row, with no pre- 
processing, this effect helps to obtain large regions with 
approximately the same complexity. When the kernel size 
is too high black speckles appear on the image, correspond- 
ing to spurious (x, z) joint distributions that are unique to 
this pixel and not found in the rest of the picture. 

Finally, no additional well-known image-processing 
technique were performed on the images, including no 
hysteresis thresholding, no double-edge detection, and no 
smoothing/multi-resolution treatment. It would be inter- 
esting to combine the complexity-based filters to other 
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6.3 Image filtering and edge detection 




Top: Left: Decisional complexity ranks for h = 0.75, r = 5 (preprocessed, normalised). Middle: Result of a Sobel 
filter (normalised). Right: The Lenna benchmark image (grey scale, unnormalised) . 

Middle: Left: Decisional complexity values for h = 0.75, r = 5 (preprocessed, normalised). Middle: Decisional 
complexity values for h — 5, r = 15 (preprocessed, normalised). Right: Decisional complexity values for h = 7, r = 17 
(preprocessed, normalised). 

Bottom: Decisional complexity values for h — 5, t = 15 (normalised). Middle: Decisional complexity values for 
h = 7, t — 17 (normalised). Right: Decisional complexity values for h = 10, r = 20 (normalised). 

Figure 10: Proposed Image filter (top left) and some variants. 
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well-established edge detection, segmentation or noise re- 
moval algorithms. The goal of this article is to introduce 
the decisional states and their applications in different con- 
texts. The proposed edge detection is only a demonstra- 
tion of how the main concept might be used in practice. 



7 Conclusion 

We derived the decisional states notion for a system: the 
internal states that lead to the same decision, given a 
user-defined utility function. Compared to alternative ap- 
proaches in the domain |10l [53] , here the utility function is 
defined on the space of predictions: U(y, z) quantifies what 
gain/loss is incurred when y is predicted while z happens. 
This makes the present work suited for applications like 
time series processing and detection of anomalous/more 
complex zones in a system, while less suited for reinforce- 
ment learning |22J. 

The framework chosen for applying the utility function 
is the e-machine [5], which is a Markovian graph model 
with higher genericity than finite-memory Markov chains 
|16| . In this context the e-machine corresponds to the 
internal structure of the system, irrespectively of any user- 
defined utility. The decisional states are built on top of 
this internal structure in a way that reflects the external 
knowledge (encoded in the utility function) brought in the 
system. 

Coming with the decisional states are definitions of com- 
plexity measures on the system. It is possible to quantify 
precisely, in number of bits, the difficulty there is to make 
an optimal prediction in terms of the chosen utility. An- 
other consequence is a way to identify events that provoke 
a change of decision, represented as transitions in a state 
diagram, assuming decisions are based on the expected 
utility. 

A family of algorithms was introduced in order to com- 
pute the decisional states from data. This family of al- 
gorithms is adaptable to specific application needs with: 
The utility function, a probability density estimator, an 
integrator, a multi-modal optimiser, etc. This algorithm 
is usable both for computing the e-machine, and for com- 
puting the newly introduced decisional states on top of 
it. 

The decisional states were exemplified mathematically 
on analytically tractable examples, and numerically on 
practical problems like image filtering. The technique 
presented in this paper is generic and applicable to a wide 
range of topics. A reference implementation is provided, 
see Appendix C. This C++ code is highly adaptable as 
well as optimised for a variety of common cases. It is 
available as free-libre software. 



Appendix A: Causality, predictions 
and causal states in a physical con- 
text 

Let us consider what a prediction means in a physical 
framework, where information transfer is limited in speed. 
Fig. [TT] displays a schematic view of a system's past and 
possible future. 




Time 

Figure 11: Light-cone representation of a prediction prob- 
lem 



In this view the system present is a single point in state- 
space. Contrast this with dynamical systems where the 
present is the whole state vector, the line in Fig. [TT] Here 
there is no instant propagation of information, and only a 
small portion of the state vector is accessible. 

The past light-cone is the collection of all points that 
could possibly have an a priori causal influence on the 
present. The future cone is the collection of all points 
that might possibly be influenced by the present state. 
The problem is that in order to infer correctly the state 
of a point F in the future cone we might potentially need 
all points in the past light cone of F. It would theoretic- 
ally be possible to have access to the points like P in the 
system current past, provided in practice that we indeed 
recorded the value of P. However there is by definition 
no way of getting the value of points like O that are out- 
side the current system past light cone. Since both points 
belong to the past light cone of a point F in our own 
future, the consequence is that even for deterministic sys- 
tems we get a statistical distribution of possible futures 
for a given observed past, depending on what informa- 
tion present outside the current past cone is necessary to 
predict the future. In other words, boundary and/or con- 
ditions in inaccessible regions may determine part of the 
future, which is well-known in physics. 

Let us now consider grouping two past cones x^[ and x^ 
together if they lead to the same distribution of futures 
p(X + \xi) — p(X + \x2)- Suppose that point P in the past 
have a distinct value for pasts x\ and x%. Then there 
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is no way to recover what the value of P was by new 
observations: we cannot use future knowledge to decide 
between x± and x^ since p(X + \x^) = For 
all practical matter these two pasts are then equivalent. 
Mathematically the associated equivalence relation x± ~ 
#2 partitions the system past light cones in sets <t{x~) = 
{y- : p(X+\y-)=p(X+\x-)}. 

The sets a are called the causal states of the system |17j . 
In a discrete scenario a new observation leads to a trans- 
ition from a state o\ to a state 02 . The causal states and 
their transitions form a deterministic automaton: the e- 
machine [8] . A neat result is the abstraction of the time de- 
pendencies into the states. The transitions between states 
include all dependencies from the past that could have 
an influence on the future, hence the e-machinc actually 
forms a Markovian automaton |17| . 

The causal states construction is not limited to light- 
cones as described above. We can also cluster together 
data (parameter) points x G X according to the condi- 
tional distributions p(Z\x) of points z G Z in a space of 
predictions. The same equivalence relation as above can 
be defined, except that now care must be exercised on the 
interpretation: all we have defined are internal states with 
the same predictive power, without referring to causality. 
For example, sneezing and coughing are good predictors 
for being ill, though they are the symptoms and not the 
cause of the illness. However, when the space X is restric- 
ted to past time (and Z to future time) as is the case in 
this section, it is a reasonable assumption that a causal 
relation indeed provides the desired predictive ability. We 
refer to the equivalence classes induced by the above rela- 
tion as causal states in this document, following the cur- 
rent usage of the term, but keep in mind that prediction 
and causation are different issues. 



Appendix B : e-machines presenta- 
tion for HMM practitionners 

In the present approach we model the data as being pro- 
duced by an e- machine, on top of which we seek the struc- 
ture imposed by a user-defined utility function. The res- 



ult in the discrete case (see section 3.4 1 is a determin- 
istic Markovian automaton, in the sense that transitions 
between states do not depend on which other state was 
previously visited. 

The framework is thus of determining hidden states and 
their transitions from data, and the e-machine is a class of 
Markov model [IB]. In the present case however the state- 
to-state transitions also correspond to symbol emissions, 
so we are dealing with an edge-emitting Markov model 
[13J. Fig. 12 shows the difference between the usual Hid- 



den Markov Models and the e-machine. 

Let us consider an alphabet A = {A,B}. In the 
Hidden Markov Model framework [T5| the hidden states 
S = {si, S2, S3} each have a different probability distribu- 





Hidden Markov Model 



£-machine 



Figure 12: Difference between usual Hidden Markov Mod- 
els and e-machincs 



tion q(X\si) over X € A for emitting the symbols while 
in state Sj. Transitions between states are determined 
by probability distributions p(sj\si) for reaching state Sj 
Symbols are produced according to qj 
o 3 , with possibly a state-to-state transition 



while in state Sj 
while in state Sj 
according to pi before symbol production. 

In the e-machine framework each hidden state 0^ G S 
corresponds to a distinct conditional probability P(X + \ai) 
of future strings x + G X + C A*. Each state <7i gathers 
past strings x~ G A* with the same conditional probabil- 
ity of future observations. The states S form a partition of 
all possible past strings. Symbols are producted on state 
transitions: when a symbol is emitted, the updated x^ +1 
including the newly emitted symbol might fall in the same 
or in another state <jj G S (see Fig. 12 1, e-machine is the 



minimal deterministic automata that is able to reconstruct 
the statistical aspects of a proces^] 

The differences between Markov processes with finite 
memory and e-machines is expressely highlighted in [16 
in terms of shift spaces [12] : Markov processes with finite 
memory cover finite-type shifts while e-machines can cap- 
ture the structure of the more general sofic shifts. A yet 
more generic type of models is proposed by [T3], with a 
tighter internal representation while retaining an optimal 
generative power, but at the cost of losing the automaton 
determinism. 

In Markov Decision Processes (MDP) |10l [7] , the pre- 
vious observations are used to build a model of the world 
(possibly with hidden states) . Actions are then chosen ac- 
cording to this model in order to maximise the expected 
utility that would result of that choice (including costs 
and rewards) . MDP consider a feasible set of actions that 
can be taken at any internal state, and the effect of these 
actions. The utility can be expressed as a function of 
histories U(x~) or it can be expressed in terms of a re- 
ward R(S — s) of being in state s and a cost function 
C(S — s, A — a) of taking action a while in state s. In 
[24 each hidden state of a the Hidden Markov Model is 
attributed a mean and variance utility of the process to 



6 Compare this with the Kolmogorov-Chaitin complexity, which 
is defined in terms of a minimal program on a Turing machine that 
is able to reconstruct the data exactly. Here we require only that 
the reconstructed data have the same statistical properties as the 
original data, and we are not dealing we exact reconstruction of 
specific strings of events. 
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be in that state. This approach fares well when the utility 
is itself a global quantity and may evolve over time. 

In the present framework, previous observations are also 
used to build a model (the e-machine). However, the e- 
machine captures equivalence classes of probability distri- 
butions of futures by definition. Hence, all past observa- 
tions within the same state will lead to the same optimal 
decisions (see section 3.2 1 : these decisions are taken based 
on predictions of what will happen next. The utility is 
a function defined by the user, and quantifies the bene- 
fits/costs incurred by comparing what we thought would 
happen (next observations, a future y + ) to what really 
happens (the true future x + ). 

An assumption is that the utility function encodes all 
the information needed to take a decision: When both the 
expected utility and the predictions are the same, we as- 
sume the user takes the same decisions. This framework is 
well suited to the scenario given in introduction, but per- 
haps not so well suited to reinforcement learning |14| [22] . 
However the same formalism is applicable to any system 
in which a prediction "usefulness" can be defined, includ- 
ing classical loss / utility functions like the minimum sum of 
squares error between the prediction and the actual future 
(See section |2|. 

Table [l] recapitulates the differences between MDP and 
Decisional States. 

A framework that is closely related to reinforcement 
learning and that also makes use of e-machines has re- 
cently been proposed [21 . It shows that a balance between 
exploration and control emerges as a consequence of us- 
ing the e-machine formalism without having to introduce 
that balance explicitly. That framework also relates learn- 
ing with energy minimisation, and it makes explicit the 
agent actions. The present approach differs in that it in- 
troduces utility functions and considers that all futures 
are not equivalent for the agent. It would be interesting 
to try combining both approaches. 



Appendix C: Web information 

The reference implementation of the algorithm presen- 
ted in this article is available on the author web site: 



http: / / nicolas.brodu.numerimoire.net 



The code is highly templatised and the classes might be 
directly included into a user project. The code is available 
as free-libre software (GNU LGPL v2.1 or more recent) 
and contributions are welcome. 

The latest experimental version of the code as well as 
any previous version are available at the source repository 



at http://source.numerimoire.net/decisional_states 



Observa- 
tions 



Utility 



Actions 



MDP 



Provided to the agent at 
each time step as a ran- 
dom variable O 4 . The ex- 
tent to which the obser- 
vation includes the current 
system state (fully and par- 
tially observable cases) de- 
termines how well the agent 
can build a model of the 
world, and how well it can 
estimate the result of its ac- 
tions. 



Provided to the agent as 
the result of its actions. 
U(x~ ) quantifies the util- 
ity of the history x~ . It 
is usually separated in a re- 
ward R(S — s) of being in 
state s and a cost function 
C(S — s, A — a) of taking 
action a while in state s. 



A feasible subset of pos- 
sible actions can be taken 
for each state. The effects 
of an action A — a at 
state S 1 * — s are described 
as P(S t + 1 \s,a). Actions 
are chosen on the expected 
utility of the history that 
would result of this choice. 



Decisional States 



Provided as pairs of past 
and future histories (or 
light-cones, see Appendix 
A) . An observation is 
o — (x^^x^). The agent 
has access to the current 
history x~ at each time. 
Internal states are not 
directly accessible to the 
agent: the information 
necessary to infer the 
system state from x~ is a 
characteristic of the system 
(Sec section |3 .4j>. 



Encodes the knowledge of 
the experimentalist, the 
cost of making mistakes 
on predictions of the fu- 
ture. t/(y~*~ , x^) quantifies 
the utility of predicting the 
future y^~ when x + really 
happens. 



Actions are implicit and 
need not be defined. We 
rather assume that the user 
acts based on predictions of 
the future (see the example 
in introduction) . Actions 
are thus mapped to sets 
of predictions y + of the 
future. When both the 
maximal expected utility 
max y+ E x+ [U(y + , x + )\x~] 
and the predictions {y } 
that reach this maximum 
are the same, we assume 
the same action will be 
taken. See Section |3| 



Table 1: Differences between MDP and decisional states 



The experiments presented in the main document are 
reproducible with the version tagged ArXiv_v2. 
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